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ON DIMENSION FOLDING OF MATRIX- OR ARRAY- VALUED 
STATISTICAL OBJECTS 
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We consider dimension reduction for regression or classification 
in which the predictors are matrix- or array-valued. This type of pre- 
dictor arises when measurements are obtained for each combination 
of two or more underlying variables — for example, the voltage mea- 
sured at different channels and times in electroencephalography data. 
For these applications, it is desirable to preserve the array structure 
of the reduced predictor (e.g., time versus channel), but this cannot 
be achieved within the conventional dimension reduction formula- 
tion. In this paper, we introduce a dimension reduction method, to 
be called dimension folding, for matrix- and array-valued predictors 
that preserves the array structure. In an application of dimension 
folding to an electroencephalography data set, we correctly classify 
97 out of 122 subjects as alcoholic or nonalcoholic based on their 
electroencephalography in a cross-validation sample. 

1. Introduction. In many contemporary statistical applications, the sam- 
pling unit of data is in the form of a matrix- or array-valued object, such 
as an image, a video clip or an electroencephalography (EEG). Such data 
sets share two distinct characteristics: they are large, usually containing gi- 
gabytes of information, and they are structured, with each dimension of the 
random arrays (e.g., the rows and columns of a random matrix) representing 
information of a different nature. The exploration, reduction, comprehension 
and analysis of such large data sets, treating each array as an observation 
while preserving its structure, produce a fresh challenge for data analysis. 
In this paper, we propose a new method, to be called dimension folding, to 
deal with such types of data sets. 
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Our method is motivated by a study of an EEG data set which con- 
cerns the relationship between genetic predisposition and tendency for alco- 
holism (http : //kdd. ics .uci . edu/databases/eeg/eeg. data.html). The 
study involved two groups of subjects: an alcoholic and a control group. 
Each subject was exposed to a stimulus while voltage values were measured 
from 64 channels of electrodes placed on the subject's scalp for 256 time 
points. The full data set requires about 3 gigabytes of memory. We are in- 
terested in the association between alcoholism and the pattern of voltage 
over times and channels. 

Figure 1 shows a typical EEG pattern for an alcoholic (upper panel) and a 
nonalcoholic (lower panel) subject, where time and channel are represented 
by two horizontal axes and voltage is represented by the vertical axis. It is 
clear that the EEG has different patterns for the two groups. We would like 
to represent these different patterns in low dimension for better comprehen- 
sion and classification. 

In mathematical terms, the predictor is a random matrix X of dimension 
PL x PR, an d the response is a random variable Y — in this case, a binary 
random variable indicating whether or not a subject is alcoholic. We are 
interested in reducing the dimension of X as much as possible while pre- 
serving the (nonparametric) regression relation between Y and X. Without 
any structural restriction on the reduced predictor, the dimension reduc- 
tion problem is no different from the conventional dimension reduction for 
vector-valued predictors. That is, one can simply treat the matrix X as a 
vector and consider the problem 



Here, vec(X) denotes the p^pR-dimensional vector obtained by stacking the 
columns of X and r] is a plPr x d nonrandom matrix with d < plPr- This 
is the classical dimension reduction problem to which all of the existing 
methods apply; see, for example, Li (1991, 1992), Duan and Li (1991), Cook 
and Weisberg (1991), Cook (1994, 1996, 1998). 

However, there are practical reasons not to treat the matrix X as the vec- 
tor vec(X). First, problem (1) does not preserve the original matrix struc- 
ture of the predictor, and so important aspects of interpretation may be 
lost. For example, for the EEG data, each column of X represents a time 
point and each row represents a channel. It would be desirable for the re- 
duced predictors to still represent time and channel so that, for example, 
we can locate particular channels or time patterns that characterize the al- 
coholic tendency most distinctively. But a predictor of the form T7 T vec(X) 
will have lost such an interpretation. Second, treating X as a matrix rather 
than a vector greatly reduces the number of parameters needed in dimension 
reduction, which enhances the accuracy of the estimated predictor. 



(1) 
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Fig. 1. Perspective plots for the alcoholic group (upper panel) and the control group 
(lower panel). 

In this paper, we give a theoretical formulation and develop estimation 
procedures for dimension reduction problems with matrix- or array-valued 
predictors, which preserve the interpretations of the underlying variables. 
Suppose that there are matrices a and f3, each with more rows than columns, 
such that Y is independent of X given ct T X/3. In symbols, 

(2) y_LLX|a T X/3. 

We then only need to know the smaller matrix a r X/3 to predict, or classify, 
Y. Meanwhile, a r X/3 preserves the interpretations of channels and times — 
its rows representing linear combinations of channels, or principal channels, 
and columns representing linear combinations of times, or principal times. 
Such information is clearly helpful: for example, we can use the linear co- 
efficients of the principal channel(s) to assess which parts of the brain is 
associated with alcoholism. 
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Letting ® denote the Kronecker product, relation (2) is equivalent to 
(3) Y JLvec(X)j(/3®Q) T vec(X). 

The challenge of dimension reduction problem (2) is that the matrix rj in 

(1) cannot, in general, be written as the matrix f3 ® a in (3). The essence 
of our approach is to seek the smallest dimensional q t X/3 so that (i) t] T X. 
is measurable with respect to a T ~K(3 and (ii) the conditional independence 

(2) is preserved. We will also extend our results to array-valued predic- 
tors. We refer to our method as dimension folding to emphasize its array- 
preserving nature and to distinguish it from the conventional dimension 
reduction methods for vector- valued predictors. 

In Section 2, we present the theoretical formulation and development 
of dimension folding. In Section 3, we introduce the key notion of the 
Kronecker envelope, which provides the guiding principle for constructing 
dimension-folding estimators from conventional dimension reduction esti- 
mators. In Sections 4 and 5, we develop three basic dimension- folding tech- 
niques: folded sliced inverse regression, folded sliced average variance estima- 
tion, and folded directional regression. In Section 6, we outline the extension 
to array-valued predictors. In Section 7, we make simulation comparisons be- 
tween different dimension-folding methods, and between dimension-folding 
methods and conventional dimension reduction methods. In Section 8, we 
apply dimension folding to the aforementioned EEG data. 

2. Dimension-folding subspaces. First, let us introduce some notation 
and terminology. For a p x q matrix A, span(A) stands for the subspace 
of MP spanned by the columns of A and Pa stands for the orthogonal 
projection onto span(A), that is, Pa = A(A T A)^A T , where f denotes the 
Moore-Penrose inversion. If S is a subspace of W and A is a matrix of full 
column rank such that span(A) = S, then we say that A is a basis matrix of 
5. Moreover, P5 stands for the projection onto 5, that is, P5 = Pa, where 
A is any basis matrix of S. For a positive integer p, I p denotes the p x p 
identity matrix. 

Suppose that there are matrices a € W LXqL and (3 € W RXqR , with qi < 
PL and qn < pn, such that (2) holds. This is then equivalent to 

Y ALX\(aA L ) T X((3A R ), 

whenever G WL qLXqL and Ar € M qRXqR are nonsingular. In other words, 
relation (2) depends on a and (3 only through their respective column spaces, 
span(o:) and span(/3). Thus, the identifiable parameters of this problem are 
column spaces of a and f3, rather than o: and j3 themselves. 
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Definition 1. If there exist a subspace Sl C W l and a subspace Sr of 
R dR such that 



then Sl is called a left dimension-folding subspace for Y\X. and Sr is called 
a ng/ii dimension-folding subspace for F|X. 

Under mild regularity conditions, it can be shown that if Sl and S' L are 
two left dimension reduction spaces for Y"|X, then 5^ D S' L is itself a left 
dimension reduction space. The same can be said of the right dimension re- 
duction subspace. The situation here is similar to that in the classical setting 
of dimension reduction where, under very mild conditions, the intersection 
of two dimension reduction spaces is itself a dimension reduction space; see 
Cook (1998), Chiaromonte and Cook (2001) and Yin, Li and Cook (2008). 
Because of the similarity, we will omit the proof of this fact in the new con- 
text and take it for granted for the rest of the paper. This closure under inter- 
section makes it possible to achieve maximal dimension folding because the 
intersection of all dimension-folding subspaces is itself a dimension-folding 
subspace. For two subspaces S\ and 52 in M m , let S\ &1S2 denote the linear 
subspace spanned by the vectors { vi <g> V2 : vi € Si , V2 € £2 } • 

Definition 2. Let <Sy| oX (or Syixo) De the intersection of all left (or 
right) dimension-folding subspaces for Y"|X. The subspace 



is called the central dimension-folding subspace and is written as 5y| Xo- 

Let (3 L E W LXdL be a basis matrix of Sy| oX and (3 R 6 W RXdR be a basis 
matrix of iSyixo- It is then easy to see that 



so the right-hand side is an equivalent definition of iSy| oXo . 

Henceforth, we no longer need to discuss any dimension-folding subspace 
that is not minimal in the sense of Definition 2, so, for brevity, we will 
refer to the central dimension-folding subspace simply as the dimension- 
folding subspace. Similarly, we will refer to the conventional central di- 
mension reduction subspace (when X is a vector) simply as the conven- 
tional dimension reduction subspace. Let Sy| V ec(x) be t ne conventional di- 



mension reduction subspace of Y versus the random vector vec(X). From 
Y JL vec(X)|(/3 jR <g> /3 L ) T vec(X), we see that 



(4) 



yiLX|p 5i xp 5j 



Sy|Xo ® <Sy| oX 




.0 



span(/3 fl ) ® span(/3 L ) = span(/3 i? <g> f3 L ) 



(5) 



<Sy| vcc(X) ^ £y|oXo- 



G 



B. LI, M. K. KIM AND N. ALTMAN 



However, the opposite relation, >Sy|oXo Q <Sy|vec(x)) does not generally hold. 
This means that if we do not wish to preserve the matrix structure of X, 
then it is possible to further reduce the dimension of X. However, 5y| Xo is 
the best that we can do if the reduced predictor is to preserve the matrix 
form. The following examples help to fix the idea. 

Example 1. Let di = dn = 2 and pi = Pr = P- The response Y is a 
Bernoulli random variable with success probability equal to 7r; the condi- 
tional distribution of X given Y is multivariate normal with conditional 
mean 

E(X\Y = 0) = pxp , E(X\Y = l)=( ^ 2 °2x( P -2) \ 

where \x ^ and rxs is an r x s matrix with all of its elements equal to 0. 
The conditional variances are specified by 

r 2 , (i,j)eA, 



vnr(X ij \Y = l) = i J 



where tr/r and A is the index set {(1,2), (2, 1)}. We assume that cov (Xij, 
X V jt) = whenever / (i',f). 

Using Bayes' theorem, we can deduce that the conditional probability 
P(Y = 1|X) [and hence also P(Y = 0|X)] is a function of X n + X 22 , Xf 2 
and X%i- So, if we let be the p-dimensional vector whose ith. element 
is 1 and other elements are 0, then the conventional dimension reduction 

2 

subspace 5y| vcc (x) is spanned by the following three vectors in W : 

ei ®ei + e 2 ®e 2 , e x ® e 2 ,e 2 (g) ex. 

In the mean time, since the smallest submatrix of X that contains Xn + 
X 2 2,Xi2 and X 2 \ is 

Xn X\2 
X21 X22 

the central dimension- folding subspace 5y| x is spanned by 
(6) ei®ei, ei(g>e 2 , e 2 ®ei, e 2 (g>e 2 . 

Thus, in this case, 5y| vec (x) is a proper subspace of <Sy| x - 

The next example illustrates a situation where 5y| vec (x) and 5y| x co- 
incide. 
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Example 2. If we choose the index set A in the definition of var(Xjj \Y) 
in Example 1 to be {(1, 1), (1, 2), (2, 1)}, then it can be shown that P(Y = 
1|X) is a function of X^X^X^i and X\\/t 2 + X22. Thus, both <Syi vec (x) 
and <Sy| xo are spanned by the set of vectors in (6). 

The subspace Sy| Xo en j°ys an invariance property similar to that of a 
conventional dimension reduction subspace; see Cook (1998), Proposition 
6.4. 

Proposition 1. Let Z = A T XB, where A and B are nonsingular ma- 
trices in W' lXPl and K PrXPr , respectively. Then 

<Sy\oZo = (B -1 <g> A _1 )5y| oX o- 

Proof. Let (3 L and (3 R be basis matrices 5y| x and <Sy|x 0) respectively. 
Because Z and X have one-to-one correspondence, we have the following 
equivalences: 

Y ALX.\plX/3 R e> Y JLX|/3lA~ T A T XBB~ 1 /3 R 

& yjLZKA-^fzB- 1 ^. 

Thus, span(A _1 /3 L ) = A _1 iSy| oX is a left dimension reduction space for Y\Z 
and span(B~ 1 /3 R ) = B Syixo is a right dimension reduction space for Y\Z. 
Consequently, 

<Sy|oZ Q A <Sy|oX) >5y|z C B <Sy|Xo- 

By the same argument, <Sy| x Q A5y| oZ and <Sy|Xo f= B5y| Zo , which com- 
pletes the proof. □ 

3. Kronecker envelopes and dimension folding. We now introduce the 
notion of the Kronecker envelope of a random matrix, which plays a key 
role in constructing dimension- folding estimators. 

Theorem 1. Let U be an (r^rx) x k random matrix for some positive 
integers ru, r R and k. There then exist sub spaces S Q \j and S\j a ofW' R and 
W L , respectively, such that: 

1. span(U) C <Suo ^Sou almost surely; 

2. if there exists another pair of subspaces Sr S W r and Sl & K ri that 
satisfies condition 1, then S\j <g> <S u £j Sr <g> Sl- 

The random matrix U, as well as the integers rL,rR,k, are related to 
specific dimension-folding methods to be described later. For example, for 
folded-SIR, U = X!" 1 £7[vec(X)|Y], in which case tr = pr, tl = Pl and k= 1. 
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For folded-SAVE, U is the random matrix X — £ 1 var(vec(X)|Y)5] . 
In this case, tr = pr, tl = Pl and k = plPr- 

Proof of Theorem 1. First, we note that there always exist Sr C R r « 
and Sl ^ K ri so that span(U) C Sr ® Sl because we can simply take Sr = 
W R and Sl = W L . Thus, the following collection of subspaces is nonempty: 

3 = {Sr © S L : span(U) <^Sr®S l ,SrQ R r *,S L C M rL }. 

We will show that 5" is a 7r-system [Billingsley (1986), page 36], that is, the 
intersection of any two members of J is a member of 

Let Sr © Sl and Sr (g) Sl be two members of Evidently, span(U) C 
(Sr ® <Sx) n (<S« ® <Sl). We now show that 

(7) <8> 5 L ) n © <S L ) = (S R n S L ) ® (5 R n S L ) . 

For two orthogonal subspaces, say <S, 5', we use «S®«S' to denote the subspace 
spanned by the vectors in S and S' . Let P r, ~Pr, P* r be the projections onto 
<Sr, Sr, Sr C\Sr, respectively, and let Pi,Pi,P2 be the projections on to 
Sl, Sl, Sl^\Sl, respectively. Then 

S R ® S L = [P*rSr © (P fl - P r )Sr] ® [P^5 L © (P L - P£)<Si] 

= (P^<S« © P^5 L ) © [P^S* © (Pi - P* L )S L ] 

®[(Pr-Pr)S r ®PISl] 

(B[(Pr-P r )Sr®(Pl-P*l)Sl} 
= (P r Sr®P* l Sl)®A. 

Similarly, 

Sr ®S L = (P*rSr © P* L S L ) © [P*rSr ® (Pi - P£)<Si] 

ffi[(P fl -P^)5 fl ®P25i] 
© [(P fl - P£)S B © (Pi - P* L )S L ] 
= (P* r Sr®P*lS l )®B. 

Note that 

(8) P* R S R © P£S £ = P^ © P* L S L = (S R n S R ) ®(s L ns L ). 

We claim that there is no nonzero common element of A and B, that is, 
AnB = {0}. This is because, by construction, 

l(p R - p r )s r ] n [(Pr - p r )s r ] = {o}, 
[(p L -P2)5i]n[(Pi-P2)5i] = {0}. 
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It follows that 

[(P* R S R ® P* L S L ) ®A]n [(P R S R ® P£S L ) ©B] = P^«S fi ® P^5 L . 

This, combined with (8), proves equality (7). Hence, J is a 7r-system. 

Let 5uo ® S \j be any member of $ that has the smallest dimension. It 
then satisfies condition 1 of the theorem. Let Sr fi) Sl be any member of 5 r . 
Then (5r <g> Sl) H (5uo ® <S u) is also a member of 5- Hence, 

dim[(5_R ® Sl) l~l (<Suo <8) Sou)] = dim(Suo ® <S u), 

which implies that 5uo<S>5oU — $R®Sl- Thus, S\j ®S \j satisfies condition 
2, which completes the proof. □ 

This theorem justifies the following definition of a Kronecker envelope. 

Definition 3. The Kronecker product space S\j Q <S><S u i n Theorem 1 
is called the Kronecker envelope of U and is written as £®(U). 

Theorem 1 guarantees that £®(U) exists and is uniquely defined. Note 
that a Kronecker envelope is defined with respect to fixed positive integers 
tl and tr. Therefore, a fully rigorous terminology should be "Kronecker 
envelope of U with respect to integers (tl, r^)." However, in our subsequent 
discussions, tl and tr will be clear from the context — they will be the 
numbers of rows and columns of a random matrix from which U is derived. 
For this reason, we will drop this qualification. 

Note that a vector v 6 W RrL is orthogonal to span(U) almost surely if 
and only if £[(v T U) 2 ] = v T £[(UU T )]v = 0. Hence, span[£(UU T )] is the 
smallest linear subspace that contains the random subspace span(U) almost 
surely. If we use S\j to denote span[S(UU T )], then Theorem 1 and Defi- 
nition 3 can both be stated with respect to S\j. Specifically, the condition 
"span(U) C Sjj ® 5 u almost surely" in Theorem 1 can be replaced by 
"5u C S\j (g> <S u" without changing the content of the theorem. In the fol- 
lowing, we will say £®(U) is the Kronecker envelope of U or that of S\j 
interchangeably. 

In the context of conventional dimension reduction [where X is a vector 
and I] = var(X)], Cook, Li and Chiaromonte (2007) introduced the notion 
of Xl-envelope as the smallest reducing subspace of S that contains the 
dimension reduction space Syix! see also Cook, Li and Chiaromonte (2009). 
Their purpose was to preserve the eigenstructure of S so as to efficiently 
handle the singularity of S. While the purpose and meaning of the Kronecker 
envelope differ from those of the S-envelope, they both serve to impose extra 
structure on a dimension reduction (or folding) subspace, with the former 
imposing an eigenstructure and the latter imposing a Kronecker-product 
structure. 
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The next theorem is the theoretical basis for all of the dimension folding 
methods that will be described in the subsequent sections. 

Theorem 2. Suppose that U is a random matrix in W° LPRXk such that 
span(U) C Sy\ vec(x) almost surely. Then 

and, consequently, S jj C <Sy| x and iStjo C <Sy |Xo- 

Proof. By (5), span(U) C 5y|x <8> <?y| x almost surely. Hence, S u C 

5y|oX an d Sua C 5y|Xo- n 

Theorem 2 means that if we can find a random vector or a random matrix 
U whose column space lies almost surely within the conventional dimension 
reduction space <Sy| ve c(x)> then its Kronecker envelope is a subspace of the 
dimension-folding subspace 5y| Xo- This is the fundamental principle by 
which we will construct estimates of <Sy | Xo- Many estimators for the con- 
ventional dimension reduction space, especially those based on conditional 
moments of X given Y, correspond to such random vectors or matrices. 
Thus, to estimate the dimension-folding subspace, all we need to do is to 
estimate the Kronecker envelope of the relevant random vectors or matrices 
which give rise to the conventional dimension reduction estimators. 

We shall focus on three conventional dimension reduction estimators: 
SIR, SAVE and DR. In fact, using the same principle, we can develop 
dimension- folding methods in conjunction with all existing moment- 
(or conditional-moment-) based conventional methods, such as those devel- 
oped in Zhu and Fang (1996), Bura and Cook (2001), Fung et al. (2002), 
Li (1992), Cook and Li (2002, 2004), Yin and Cook (2002), Ye and Weiss 
(2003), Ferre and Yao (2005) and Li, Zha and Chiaromonte (2005). 

4. Objective functions for Kronecker envelopes. In this section, we in- 
troduce a general objective function whose minimization gives the Kronecker 
envelope, which will guide us in the construction of sample estimates of Kro- 
necker envelopes. 

4.1. Conventional dimension reduction estimators. We first review some 
basic facts about SIR, SAVE and DR in the conventional setting. Let X be 
a p-dimensional random vector and 5] = var(X). Let f3 be a basis matrix of 
5y|x- SIR is based on the fact that if 

(9) £(X|/3 T X) is linear in /3 T X, 

then the random vector 



(10) 



E -1 J B(X|F) 
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belongs to <Sy|x almost surely; see Li (1991). Let (X, Y) be an independent 
copy of (X, Y). SAVE and DR are based on the fact that if, in addition to 
condition (9), we have that 

(11) var(Xj/3 T X) is a nonrandom matrix, 

then the column spaces for the random matrices 
E- 1 [E-var(X|F)] (SAVE), 

(12) 

E~ 1 [2i]-s((x-x)(x-x) T |y,y)] (dr) 

are subspaces of Sy\x almost surely; see Cook and Weisberg (1991) and Li 
and Wang (2007). 

Vectors in the central space are extracted by eigendecompositions corre- 
sponding to relations (10) and (12). For example, for SAVE, let 

A = E[I p - var(Z|y)] 2 where Z = 5T 1/2 X, 

and let vi, . . . , v<i be the eigenvalues of A corresponding to nonzero eigen- 
values. Then {S~ 1 / 2 Vj :i = 1, . . . ,d} spans (at least) a subspace of Syix- 

4.2. General form of the objective function. We now return to the matrix 
predictor case, where X € W lXPr . Let r\ be a basis matrix of the conven- 
tional dimension reduction subspace 5y| vec (x)- By the discussions in Section 
4.1, if E[vec(X.)\r} T vec(X)] is linear in T7 T vec(X), then the random vector 
(10), with X replaced by vec(X) and X redefined as var[vec(X)], belongs 
to 5y| vcc (X) almost surely. If, in addition, var[vec(X)|?7 T vec(X)] is nonran- 
dom, then, with the same replacements, the column spaces of the random 
matrices in (12) are subspaces of 5y| vec (x) almost surely. By Theorem 2, 
we can estimate the dimension-folding subspace 5y| Xo by targeting the 
Kronecker envelopes of the SIR, SAVE and DR estimators of <Sy| vec (x)- 
We refer to the dimension-folding methods thus constructed as folded-SIR, 
folded-SAVE and folded-DR, respectively. 

Again using the folded-SAVE to illustrate the idea, we minimize the ob- 
jective function 

E\\% RVL -var(vec(Z)|Y)] - £ x / 2 (b ® a)f (Y) || 2 , 

where vec(Z) = £™ 1 / 2 vec(X), over matrices a, b and matrix- valued func- 
tions f (•). The matrix S 1 ' 2 in front of (b<g>a) corresponds to the transforma- 
tion of Vj to 5] _1//2 Vj in a conventional procedure. The Kronecker product 
structure is imposed through the regression coefficient matrix b (g> a. The 
next theorem shows that the solution to this minimization problem indeed 
gives the Kronecker envelope of S~ 1 [S — var(vec(X))], the object we desire. 
The theorem is stated sufficiently generally to cover all three methods. 
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Let U be a PrPl x k random matrix, cxq and /3 be the basis matrices of 
<Sou an( A 5uo, respectively, and tul and tur be the dimensions of 5 u and 
S\jo, respectively. For positive integers k\ and k2, and a random vector W 
defined on O w , let L klXk2 (n w ) be the class of functions f : O w -> M felXfc2 
such that £'||f(W)|| 2 < oo, where || • || is the Frobenius matrix norm. 

Theorem 3. Suppose that the elements o/U have finite variances and 
are measurable with respect to a random vector W and that A is a PrPl x 
PrPl nonrandom and nonsingular matrix. Let (a*,b*,f*) be the minimizer 
of 

(13) £||AU- A(b®a)f(W)|| 2 

over all a G W L * mL , b £ RPR xm R and f G L^ LmRXk (fl w ). Then 

span(b*<g>a*) = £®(U). 

Proof. Since span(/3 <g> cto) = f ®(U) and the elements of U are mea- 
surable with respect to W, there is a random matrix 0(W) 6 L2 limflXfc (f2w) 
such that U = (/3 <8> cto)0(W), which is equivalent to 

AU = A(/3 o ®«o)0(W). 

Thus, (13) reaches its minimum within the range of (a, b,f) given in the 
theorem. This implies that any minimizer (a*,b*,f*) of (13) must satisfy 
A(b* <gia*)f*(W) = AU almost surely and, consequently, 

(14) ((3 ® a o )0(W) = (b* ® a*)/*(W) almost surely. 

But this means that span(b* <S> a*) contains U almost surely and has the 
same dimensions as £®(U). The theorem now follows from the uniqueness 
of the Kronecker envelope. □ 

In the general objective function (13), the matrix A is X 1 / 2 for all three 
dimension- folding estimators. The random element W is the random vari- 
able Y for folded-SIR and folded-SAVE; it is the random vector (Y,Y) for 
folded-DR. The random element U is the random vector S~ 1 £'[vec(X)|y] 
for folded-SIR; it is random matrix — var(vec(X)|Y)]E -1//2 for folded- 

SAVE; it is the random matrix 

S-!{2S -£[(vec(X) - var(X))(vec(X) - var(X)) T |y, y]}5T 1/2 

for folded-DR. Note that the f(Y) for folded-SIR is an m^mx,-dimensional 
vector, whereas the f(Y) for folded-SAVE and f(Y, Y) for folded-DR are 
mRiriL x PrPl matrices. 

The construction of the objective function in Theorem 3 expresses condi- 
tional mean in a minimization problem, which echoes the constructions used 
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in Cook and Ni (2005), Li and Dong (2009) and Dong and Li (2009). This 
construction allows us to impose the Kronecker structure on minimization. 

In the context of conventional dimension reduction, Li and Wang (2007) 
showed that both SAVE and DR are exhaustive, that is, the columns of the 
matrices in (12) do not lie in a proper subspace of <Sy|x- It is also known 
that SIR is not exhaustive when the relation between Y and X contains a 
U-shaped trend. Meanwhile, even though SAVE is exhaustive at the popu- 
lation level, the sample estimate is often insensitive to monotone trend. Li 
and Wang (2007) give strong evidence that DR combines the advantages of 
both SIR and SAVE. A dimension-folding method inherits the exhaustive 
property from its conventional counterpart. Specifically, let F n be the em- 
pirical distribution based on the sample (Xi, Yi), . . . , (X n , Y n ) and let Fq be 
the true distribution of (X, Y). We say that a matrix-valued statistics /3(F n ) 
is an exhaustive estimator of a subspace S if span[/3(i ? o)] = S. 

Proposition 2. If (3(F n ) is an exhaustive estimator of the conventional 
central space 5y| V ec(x); then the Kronecker envelope of span[/3(F n )] is an 
exhaustive estimator of the dimension-folding space 5y| Xo- 

Proof. Since (3(F n ) is exhaustive, span[/3(i ? o)] = 5y| V ec(x)- Then £ = 
5®{span[/3(Fo)]} is a Kronecker product space satisfying Y JL X|P£ vec(X). 
It follows that 5y| xo Q £• I n the mean time, since 5y| Xo is a Kronecker 
product space containing vec(X), we have £ C <Sy| x - ^ 

5. Estimation. In this section we develop an algorithm to minimize the 
sample version of the objective function (13). A very appealing property 
of the algorithm is that it can be broken down into iterations among three 
elementary steps, each being essentially least squares. This makes the min- 
imization relatively fast and stable, even for a large number of parameters, 
which is extremely important for our applications, where the number of 
parameters is easily in the thousands. 



5.1. Population-level solution. We need the notion of a commutation ma- 
trix. If A is an n x r2 matrix, then K riir2 is the unique matrix in W" Lr2Xrxr2 
that transforms vec(A) to vec(A T ) : K rijr2 vec(A) = vec(A T ). The explicit 
form and the properties of a commutation matrix can be found in Mag- 
nus and Neudecker (1979). Two properties that will be useful are: (a) if 
A G M riXr2 and B = R^ Kr \ then 

(15) A<8)B = K Pl>r8 (B® A)K r4 , r2 ; 

and (b) for any integer r, K rj i = Ki r = I r . 
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Lemma 1. Let A and B be matrices inM. VlXT2 and W"- iXT4 , where r±, ... 
are positive integers. Then 



where n = I r2 ® [(I r4 ® K rijr3 )K r3r . 4iri ] . 

Proof. Since A ® B = (ai ® B, . . . , a r2 ® B), the vector vec(A ® B) 
consists of vec(ai®B), . . . ,vec(a r2 ®B) stacked together vertically. By (15), 
aj<8>B = K rijJ , 3 (B ® a,j). Hence, vec(aj®B) = (I r4 ®K rijr3 ) vec(B®aj). But 
it is easy to see that vec(B ® an) = vec(B) ® a,j. Apply (15) again to obtain 
vec(B) ® a.i = K. r3r4jri (a, ® vec(B)). Hence, vec(A ® B) becomes 



\ (Ir4 ® K ri jr3 )K r3r4jri ( a r2 (8) vec(B)) / 

which can be written as {I r2 ® [(I r4 ® K rijr3 )K r3r4jri ]}[vec(A) ®vec(B)], as 
desired. □ 

In the following, II is the matrix defined in Lemma 1 with (n, r 2 , r^, r^) 
taken to be (pR,m R ,p L ,m L ), that is, II = I mR ® [(I m£ ® Kp HiPi )Kp imii p fl ]. 

Theorem 4. 1. For fixed f € L™ RmLXk (n w ), a G M^ xm ^, tfte mini- 
mizer of (13) over b G RP* xm fl is b = [£(VjV 2 )] _1 #(VfVi), where 

(17) V 1= vec(AU), V 2 = (f T ® A)II[vec(a) ® I PflmK ]. 

2. For fixed f € L™ flmLXfc (fi w ), b G Rf« xm «, ffc e minimizer of (13) over 
ae ^PLxm. L i s SL= [ErV^V 2 )}~ 1 E(VlV 1 ), where 

(18) V 1= vec(AU), V 2 = (f T (g)A)n[Ip imi ®vec(b)]. 

5. For fixed a G RP£ x?n £ and h G RPfl xm «, #ie minimizer of (13) over 
fe^ flm£Xfc (fi w ) is f(w) = (V^V 2 )- 1 [V 2 r V 1 (w)], w/iere 

(19) V 1 (w)=vec[AU(w)], V 2 = I PhPl ® [A(b ® a)]. 

Proof. By standard calculations, if Vi is an n -dimensional random 
vector and V 2 is an r\ x r 2 -dimensional random matrix, each having finite 
second moments, then the minimizer of 

(20) J3||Vi - V 2 c|| 2 
over all c G W 2 is 



(16) 



vec(A ® B) = II[vec(A) ® vec(B)] 




(ai ® vec(B)) 




(21) 



[^(V^V,)]- 1 ^^^!). 
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We now rewrite the objective function (13) as 

(22) £||vec(AU) - vec[A(b ® a)f] || 2 . 

To prove part 1, note that 

vec[A(b<g>a)f] = (f T <g> A) vec(b ® a) = (f T <g> A)II[vec(b) <g)vec(a)], 

where the second equality follows from Lemma 1. Note that vec(b) ® vec(a) = 
vec[vec(b) vec T (a)] = [vec(a) <g) l PR m R ] vec(b). Hence, 

vcc [A(b <g> a)f] = (f T ® A)n[vec(a) ® 

IpRm R ] vec(b). 

Thus, (22) is of the form (20), with Vi, V2 defined as in (17) and c = vec(b). 
The assertion of part 1 now follows from (21). 

The proof of part 2 is similar to that of part 1 and is thus omitted. Let 
us turn to part 3. For each fixed w, f(w) is the minimizer of 

E[\\AU - A(b ® a)f(Z)|| 2 |W = w] 

(23) 

= ||vec[AU(w)] - [I PRPL ® A(b®a)]vec[f(w)]|| 2 , 

where, since U(w) and f(w) are fixed given W = w, the conditional ex- 
pectation E(-[W = w) disappears. Now, apply (21) to (23) with Vi(w),V2 
defined in (19) and c = vec[f (w)] to complete the proof. □ 

When k = 1, as is the case for folded-SIR, the solution can be further 
simplified. Let a be a vector in M. rs , where r and s are positive integers. 
Thus, a can be written as (a^, . . . ,aJ) T , where each a^ is a vector in W . 
We define mat r (a) to be the r x s matrix (ai,...,a s ). This is an inverse 
operation of vec, in the sense that, for any matrix A € M rxs and any vector 
a G W s , we have 

mat r [vec(A)] = A, vec[mat r (a)] = a. 

Note that the operation mat r is specified by a number r, but no such spec- 
ification is needed for the definition of vec. A useful property of the mat 
operation is that if A e M riXr2 , b G W 2rs and C G W sXr4 for some positive 
integers r\, . . . ,r$, then 

(24) mat ri [(C T <g> A)b] = Amat r . 2 (b)C. 

This can be verified by taking vec on both sides and observing that 

vec[Amat r . 2 (b)C] = (C T <g> A) vec[mat r . 2 (b)] = (C T ® A)b. 

If k = 1, then f is an m/jm^-dimensional vector. So, by (24), (b <g> a)f can 
be written as vec[amat mi (f)b T ], which, in turn, can be written as 

[1 PR (g)amat mi (f)] vec(b T ) = [I PR <g> amat mi (f)]K PRjmR vec(b) 
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or 



[bmat^(f)(g)lpjvec(a). 
Thus, the V2 in (17), (18), (19) in Theorem 4 can be replaced by 



A[I PR ® amat mi (f)]K PRiTnR 



,A[bmat^(f)®I PL ],A(b®a), 



respectively. This alternative expression often requires less computation for 
folded-SIR. 



5.2. Numerical procedures. We now describe the estimation procedures 
for folded-SIR and folded-DR at the sample level. The procedure for folded- 
SAVE is similar to folded-DR and is thus omitted. Let (Xi, Y±), . . . , (X n , Y n ) 
be an i.i.d. sample of (X,Y). We estimate £ by the sample moment 

n 

£ = n" 1 vec(Xj — X) vec T (Xj — X). 

i=l 

As with the conventional dimension reduction methods such as SIR, we 
discretize the response Y. Let Ji, . . . , J s be a partition of S7y . Let D = S(Y) 
be the discrete random variable defined by 

5{Y)=£ ]fY€J e ,£ = l,...,8. 

For a function h of (X,y), let E n h(X.,Y) denote the sample average n _1 x 
Yli=l h(^-i,Y)- We summarize the estimating procedure for the folded-SIR 
as the following five-step algorithm: 

1. generate the initial values of ao G M PiXrrii , {{q(£) :£ = 1, . . . ,s}, say, from 
a sample of the N{0,1) variables; 

2. for £ = 1, . . . , s, compute pi = E n [I(D = £)] and 

Vi(£) = PJ 1 vee{£-V 2 E n [vec(X)I(D = £)]}, 
V 2 (£) = S 1/2 [I Pfl ® aomaWjftW)]^, 
then compute vec(bi) by 



(25) 



I>vT(*)v 2 (*) 



i=l 



3. recompute V 2 (i?) as J] 1//2 [bi mat^ (foCO) ®IpzJ> then compute vec(ai) 
using (25), but with the recomputed V 2 (£); 

4. recompute V 2 as I Pii p L ® [S x / 2 (bi <8> ai)], noting that, at this step, V 2 
does not depend on £, then compute ti(£) = (V^V^^V^V i(.£); 
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5. return to step 2 and iterate, each time using the most recent a, b and f , 
until 

s 

J2M\£~ 1/2 E n [vec(X)\D = £]- S^b ® a)f (£)\\ 2 
i=\ 

stabilizes, then use the resulting a and b as the estimates of olq and /3 . 

The algorithm for folded-DR is similar to folded-SIR, except that the 
Vi and V2 become more complicated. Let V = vec(X), A = V — V and 
D = 5(Y). Then, for k,£= l,...,s, 

E{AA T \D = k,D = £) 

= E(W T \D = k) - E(V\D = k)E(V T \D = i) 
- E(V\D = £)E(V T \D = k) + E{VV T \D = £). 

Let ni,. . . ,n s be the numbers of observations in slices Ji, . . . , J s . The sample 
estimate for the above conditional expectation is 

E n (AA T \D = k,D = £) 



nunc 1 — ' 1 — ' no 

r£j e teJ fe * t£j e 

We now summarize the algorithm for folded-DR: 

1. generate the initial values of ao € W LXmL , {io(k, £) : k,£ = 1, . . . , s} from, 
say, a sample of the iV(0, 1) variables; 

2. for k,£ = 1, . . . , s, compute p^e = n^nijn and 

Vi(M) = vec{5r 1/2 [2S - E n (AA. T \D = k,D = £)}±~ 1/2 }, 
V 2 (k,£) = [f T (M) ® S 1 / 2 ]n[vec(a ) ® 
then compute vec(bi) using the formula 



(26) 



EE^^CM)^, 



,fc=i £=1 



_fc=i 

3. recompute ^2(^5^) as 

V 2 (M) = [f T (M) ® 2 1/2 ]npW ® vec(bi)], 
then compute vec(ai) by (26), using the newly computed \~2(k,£); 
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4. compute f*i (&;,£) by 

fi(fc, I) = [(b <8> ai) T £(bi <8> aOJ-^bi <8> ai) T 

x [2£-£ n (AA T |£> = A;,Z) = £)]£- 1/2 ; 

5. repeat steps 2, 3, 4, using the most updated a, b and f at each step, until 
the objective function 

s 

^2 Pw||S~ 1/2 [2S- E n (VV T \D = k,D = l)}^- 1 / 2 -± 1 / 2 {b®3L)f(l,£)\\ 2 

k,£=l 

stabilizes. 

5.3. Singularity of X. When prPl > n, the sample covariance matrix S 
of vec(X) is singular and does not exist. There are several ways to 
deal with this. One is to replace vec(X) by its principal components. Chi- 
romonte and Martinell (2002) and Li and Li (2004) used this method in the 
conventional setting. If all principal components corresponding to nonzero 
eigenvalues of S are used, then this amounts to using the Moore-Penrose 
inverse in place of S -1 . Another option is to use the ridge-regression- 
type inverse (S + eI pj j Pi ) _1 , where e > 0, in place of see Hoerl (1962) 
and Marquardt (1970). For a related development in conventional dimension 
reduction, see Tyekucheva and Chiaromonte (2008) and Li (2008). Finally, 
it is possible to adapt the iterative transformation approach of Cook, Li and 
Chiaromonte (2007) to dimension folding, but further research is needed in 
this regard. In the subsequent simulations and application, we use the first 
two approaches to handle the singularity of S. 

5.4. Robustness. The dimension-folding methods proposed here are based 
on sample moments, which are known to be sensitive to outliers. Zhou (2009) 
described a weighting scheme to achieve robustness in the conventional set- 
ting for a dimension reduction method derived from canonical correlations 
[Fung et al. (2002)]. We outline how that scheme can be adapted to dimen- 
sion folding. 

For a given sample, let u»(x), x € M PlXPr be a decreasing and nonneg- 
ative function of [vec(x) — E n vec(x)] T S _1 [vec(x) — £' n vec(x)] such that 
^" =1 w(Xj) = 1. To downplay observations lying far away from the cen- 
ter of observed predictors, we replace the empirical measure that assigns 
probability mass 1/n to each pair (Xj,li) with the alternative random mea- 
sure that assigns probability mass u)(Xj) to (X.i,Yi). We then replace the 
usual sample moments and sample conditional moments by moments calcu- 
lated from this alternative measure. For example, for folded-SIR, we replace 
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£ n vec(X), S and E n [vec(X)\D = £] by 

n 

E*vec(Xi) = ^w(Xf)vec(Xt), 

i=l 

^u;(X i )[vec(X 4 ) - vec(X)][vec(X 4 ) - vec(X)] T , 
i=l 

n n 

^ w(Xi) vec(Xi)/(A = *)/ £ t&(Xi) J(A = 

i=l i=l 

The rest of the algorithm remains the same. Folded-SAVE and folded-DR 
can be robustified in a similar fashion. 

6. Array- valued predictors. As mentioned in the Introduction, we some- 
times also encounter sampling units in the form of higher-dimensional arrays. 
For example, a video clip is a three-dimensional array. In this section, we 
extend dimension folding to these cases. For reasons of brevity, we omit the 
details of algorithms, which can be constructed analogously. 

Let X = {Xj l ...j u :ji = 1, . . . ,pi, . . . , j u = 1, . . . ,p u } be a u-way random ar- 
ray of dimension p\ x • • • x p u , and let Y be a scalar- valued random response. 
Our goal is to reduce X to a smaller tt-way array of dimension d\ x • ■ • x d u 
while preserving the regression relation between X and Y . That is, we seek 
nonrandom matrices 

= { a iih :i i = h---,Pi,ji = -,di}, 

OC^ ^ = { a i u j u '■ 2« = 1) • • • iPuiju = 1; ■ • ■ ) d u \, 

such that Y is conditional independent of X given the array 

{Pl Pu "| 

il=l iu=l ) 

Let vec(X) denote the vector of elements of X with its first index changing 
the fastest. That is, 

vec(X) = (A 1 '" 1 , A 2 '" 1 , . . . , A 1 '" 2 , A 2 '" 2 , . . .,X*'"»>) T . 

Parallel to the definition of the mat r operator introduced in Section 5, we 
define arr Pl ... Pu as the inverse operator of vec(X). That is, arr Pl ... Pu [vec(X)] = 
X. The array (27) can then be written as 

arr pi ... Pii [(c^ ® ■ • ■ ® a«) T vec(X)]. 
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Since the above array has a one-to-one relation with <g> ■ ■ -^a^) T vec(X), 
the general dimension-folding problem can be stated as 

(28) Y lLX|(a (u) ® •••<g>a (1) ) T vec(X). 

We note that, as in the matrix-predictor case, the order of 1, . . . , u is reversed 
in the string of Kronecker products: the coefficient matrix associated with 
the last index of X appears first in the string of Kronecker products. 

The central dimension-folding subspace is then defined as the smallest 
subspace 

span(cr u -' ) (g> • • • ® span(ct^ 1 ^ ) 
for which the relation (28) is satisfied. This subspace will be written as 

Sy |X°" ■ 

Once again, the idea is to start with a random matrix whose column space 
lies almost surely in the conventional dimension reduction space 5y| vcc (X) 
and to use the Kronecker envelope of this random matrix to estimate the 
dimension-folding subspace. The next theorem is parallel to Theorem 1 and 
Definition 3, so its proof is omitted. 

Definition 4. Let U be a random matrix in U.(pi-P*) xk . There are sub- 
spaces <Si C MP 1 , . . . , S u C MP n of dimensions t± < pi, . . . , t u < p u such that: 

1. span(U) C Si <8> • • • (8> S u almost surely; 

2. if there exists another it-tuple of subspaces «S[ C MP 1 , . . . ,S' U C W u that 
satisfies condition 1, then 

Si ® • • • (8) S u C S[ (8) • • • (8) S' u 

and the subspace <Si <8> • • • <8> S u is called the Kronecker envelope of U. 

We denote the generalized Kronecker envelope of U by £® _ (U). Using 
an argument similar to that used in Section 3, we can prove the following 
result. 

Theorem 5. Let X be a random array in W lX "' XPu . IfXJisa random 
matrix in M.( pi "' Pu ^ xk whose column space is contained in 5y| vcc (x) almost 
surely, then £p 1 ... Pu (JJ) is contained in <Sy|x°«- 

This theorem provides the guiding principle for estimating the central 
dimension-folding space Sy |x° u • That is, we start with a conventional dimen- 
sion reduction method such as SIR, SAVE or DR for estimating <Sy| vcc ( X ) 
and then construct the estimates of its Kronecker envelope via objective 
functions analogous to (13). 
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7. Simulation studies. In this section, we evaluate by simulation the per- 
formance of the three dimension folding estimators, in a classification prob- 
lem in which the response is a binary variable and the predictor matrices 
corresponding to the two values of Y differ both in location and variation. 
Our comparison is twofold: we compare the performance among the three 
dimension folding methods themselves and compare them with conventional 
dimension reduction methods when the dimension reduction subspace coin- 
cides with the dimension-folding subspace. While dimension-folding meth- 
ods are introduced primarily to preserve the array structure, intuitively, 
they should be more accurate than their conventional counterparts when 
SoXo = <5y| vcc (x) because the former contains far fewer parameters. The 
second comparison is made in order to confirm this intuition. 

To assess the accuracy of a dimension-folding method, we use the criterion 

( 29 ) _P /3®«II> 

where || • || is a matrix norm, which, for example, can be the Frobenius norm 
or the largest singular value. This is a measure of discrepancy between the 
subspaces span(/3* <8> at*) and span(/3§a); see Li, Zha and Chiaromonte 
(2005) for intuition about and further discussion of this criterion. In the 
following, we use the Frobenius norm. 

To make a sensible comparison, it is helpful to define a "benchmark" of 
this discrepancy, that is, its value when the two spaces are not related at 
all. Let a* G M_PL xd L anc [ p* g ygPRX-dR ^ e ranc [ om matrices whose entries are 
i.i.d. standard normal. We define E(\\Pp*& a * — P/3® a ||) to be the benchmark 
distance. The benchmark is easily computed by simulation. It depends on 
dimensions PL,PR,d-L and d,R, but is independent of the model and the esti- 
mator, as well as of a and (3 (despite its appearance). A similar benchmark 
was used in Li, Wen and Zhu (2008) in the classical setting. The performance 
of the conventional dimension reduction methods is assessed similarly. Let 
r) be the conventional dimension reduction estimates of 77, a basis matrix 
for cSy| vcc(X ). We use 

(30) I|P*)-PJ 

to assess the error of the conventional methods. Note that P/3® a = P^, so 
the comparison is on equal footing. 

Example 1 (Continued). Let X and Y be defined as in Example 1 in 
Section 2. We take it = 1/2, a 2 = 0.1 and r 2 = 1.5. Recall that, in this case, 
£y|vec(x) is a proper subset of «Sy| xo- 

We generate n pairs of observations, (Xi, Y\), . . . , (X n , Y n ), from this 
model, with n = 100,200,300,500,800 and p = 5,10. We apply folded-SIR, 
folded-SAVE and folded-DR. Table 1 gives the means of criterion (29), as 
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Table 1 

Comparison among dimension-folding methods 



Method 


n= 100 


n = 200 


n = 300 


n 


. = 500 


n = 800 




PL =PR 


= 5 (benchmark distance = 


2.586) 






Folded-SIR 


1.115 


0.751 


0.598 




0.496 


0.369 


Folded-SAVE 


0.566 


0.295 


0.220 




0.161 


0.121 


Folded-DR 


0.531 


0.287 


0.215 




0.158 


0.119 




PL=PR- 


= 10 (benchmark distance = 


= 2.772) 






Folded-SIR 


2.006 


1.289 


1.034 




0.772 


0.604 


Folded-SAVE 


2.710 


1.410 


0.581 




0.345 


0.236 


Folded-DR 


2.296 


1.019 


0.542 




0.331 


0.230 



calculated from N = 500 simulated samples for each combination of n and p. 
The standard errors of these means are all within 0.02 and are not presented. 
From the table, we can see that the overall best performer is folded-DR, fol- 
lowed by folded-SAVE and folded-SIR. Both folded-DR and folded-SAVE 
perform much better than folded-SIR. This is because the two mixing com- 
ponents for Y = and Y = 1 differ both by location and variance, the latter 
of which cannot be captured by folded-SIR. 

To give a sense of how the methods perform, in Figure 2, we present 
the scatterplot matrices of the four elements of a r Xb (mx = mji = 2), as 
estimated by folded-SIR (left panel) and folded-DR (right panel). We see 
that the four predictors by folded-SIR separate the two groups by location, 
whereas folded-DR separates them by both location and variation, as shown 
in the (All, A22) plot in the lower panel. 

Example 2 (Continued). Let X and Y be defined as in Example 2 in 
Section 2. Again take 7r = 1/2, a 1 = 0.1 and r 2 = 1.5. The difference from 
the previous example is that the index set A for the definition of v&i(Xij\Y) 
is changed to ensure that >Sy| ve c(x) = »Sy|oXo; so ^hat ^he comparison of 
dimension-folding methods and conventional dimension reduction methods 
is on an equal footing. 

With the same choices of n, p and N, in Table 2, we present the means of 
either criterion (29) (for dimension-folding methods) or criterion (30) (for 
conventional dimension reduction methods). We observe very substantial 
improvements by dimension folding. This is due to the fact that the column 
space of (3 ® cc contains far fewer parameters than the column space of r/, 
if both matrices have the same dimension. We also see that folded-SAVE 
and folded-DR perform much better than folded-SIR and the same pattern 
holds for their conventional counterparts, for the same reason explained in 
the previous comparison. 
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-3 -1 1 2 3 -4 -2 2 4 



Fig. 2. Scatterplot matrices for the reduced predictors estimated by folded-SIR (upper- 
panel) and folded-DR (lower panel) for Example 1. 
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Table 2 

Comparison between dimension-folding and conventional dimension reduction 



Method 


n= 100 


n = 200 


n = 300 


n = 500 


n = 800 




PL =PR 


= 5 (benchmark distance = 


2.586) 






Folded-SIR 


1.057 


0.716 


0.580 




0.432 


0.343 


SIR 


1.865 


1.806 


1.758 




1.759 


1.753 


Folded-SAVE 


0.523 


0.287 


0.221 




0.161 


0.123 


SAVE 


1.615 


1.294 


1.089 




0.757 


0.579 


Folded-DR 


0.497 


0.278 


0.215 




0.157 


0.119 


DR 


1.596 


1.289 


1.075 




0.747 


0.574 




PL=PR- 


= 10 (benchmark distance = 


: 2.586) 






Folded-SIR 


1.924 


1.246 


0.984 




0.750 


0.577 


SIR 


2.626 


2.142 


2.051 




1.963 


1.921 


Folded-SAVE 


2.709 


1.085 


0.537 




0.334 


0.234 


SAVE 


2.753 


2.677 


1.956 




1.605 


1.406 


Folded-DR 


2.271 


0.850 


0.505 




0.321 


0.226 


DR 


2.753 


2.503 


1.871 




1.593 


1.392 



8. Application. We now apply the dimension-folding methods to ana- 
lyze the EEG data mentioned in the Introduction. The study involved two 
groups of subjects: an alcoholic group of 77 subjects and a control group of 
45 subjects. Each subject was exposed to either one stimulus or two stimuli. 
During an exposure, the voltage values were measured from 64 channels of 
electrodes and for 256 time points (at 256 Hz per second). The 64 electrodes 
are placed at different locations on the subject's scalp. The stimuli were 
pictures chosen from a picture set. When two pictures were shown, they 
were displayed in either a matched condition, where two pictures were iden- 
tical, or a unmatched condition, where they were different. Each subject 
had 120 trials under these three conditions: single stimulus, two matched 
stimuli and two unmatched stimuli. The primary interest was to study the 
association between alcoholism and the pattern of voltage values over times 
and channels. 

To keep matters simple, in this paper, we use only part of the data set: 
we include only the single stimulus condition and, for each subject, we take 
the average of all the trials under that condition. That is, the portion of 
the data we use consists of (Xi, Yi), . . . , (X122, ^122) j where Xj is a 256 x 64 
matrix with each entry representing the mean voltage value of subject i at 
a combination of a time point and a channel, averaged over all trials under 
the single stimulus condition, and Yj is a binary random variable indicating 
whether the ith subject is alcoholic (Yi = 1) or nonalcoholic (Yi = 0). 

To apply the dimension-folding methods, we need to perform the spectral 
decomposition on the PlPr X PlPr = 16384 x 16384-dimensional matrix X, 
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which is quite large. So, prior to dimension folding, we have implemented 
a somewhat heuristic pre-screening phase. Let vi,...,v Si be the first sl 
eigenvectors of the matrix E n (X. — X)(X — X) r and wi, . . . , w Sfl be the first 
sr eigenvectors of the matrix E n (X. — X) T (X — X). Let V = (vi, . . . , v S£ ) 
and W = (wi, . . . , w s J T . Let X* = V T X,W. 

Using two sets of dimensions, (sl, SR,d,L,dji) = (30,20,2,2) and (30,20, 
1,2), we apply folded-SIR, folded-SAVE and folded-DR to the pre-screened 
data set (X*, Yi), . . . , (X* , Y n ). The results for (d,L,dji) = (2, 2) are presented 
in Figure 3, which contains two scatterplot matrices of the four predictors in 
the 2x2 matrices a T X*b obtained by folded-SAVE (left panel) and folded- 
DR (right panel). The four predictors are labeled as Xll, X12, X21, X22 in 
the plots. A striking feature of these plots is that the EEG data for alcoholic 
cases (represented by red o's) show markedly less variation than those for 
nonalcoholic cases (represented by black +'s). This can be interpreted as 
indicating that the EEG patterns for the alcoholic subjects are more similar 
than those for the nonalcoholic cases. We also observe that folded-SAVE 
predictors show strong separation by variation, but no obvious separation 
by location, whereas folded-DR successfully separates the two clusters by 
both location and variation. The results for (dL,dji) = (1,2) are presented 
in Figure 4, which contains three scatterplots for the two predictors in the 
1x2 matrix a T X*b obtained by folded-SIR (upper panel), folded-SAVE 
(lower- left panel) and folded-DR (lower-right panel). The two predictors are 
labeled as Xll, X12 in the plots. From these plots, we observe the differences 
in performance of the three methods: folded-SIR works well in separating 
locations, folded-DR works well in separating variations, whereas folded-DR 
combines the advantages of both. 

Of course, the ultimate purpose of dimension folding (or more gener- 
ally, dimension reduction) is to assist regression or classification. Thus, the 
true test for the usefulness of our methods is whether they can help us 
to identify whether or not a person is alcoholic using his/her EEG data. 
For this reason, we have performed a classification analysis after dimen- 
sion folding. For each i = 1, ...,n, we withhold the ith subject from the 
sample, treating it as the test set and the remaining 121 subjects as the 
training set. Based on each training set, we first carry out dimension fold- 
ing (including pre-screening) and then apply quadratic discriminant analysis 
[Johnson and Wichern (2007), Chapter 11] to develop a classification rule. 
This classification rule is then used to classify the withheld subject. Using 
(sl,sr, dL,dn) = (15, 15, 1, 2) and the folded-DR, we correctly predicted (as 
alcoholic or nonalcoholic) 97 out of the 122 cases; folded-SIR correctly classi- 
fies 94 out of 122 cases. We also compute the number of correct classifications 
using the conventional SIR, which gives 92 out 122 correct decisions. For the 
conventional SIR, we use (sl,sr) = (9,9) and d = 1. For all three methods, 
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Fig. 3. Scatterplot matrices for the four reduced predictors estimated by folded-SAVE 
(upper panel) and folded-DR (lower panel), for (dL,dn) = (2,2). Red o's represent the 
alcoholic cases; black + 's represent the nonalcoholic cases. 
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Fig. 4. Scatterplots for the two reduced predictors estimated by folded-SIR (upper panel) , 
folded-SAVE (lower-left panel) and folded-DR (lower-right panel) , for (dL,da) = (1,2). Red 
o 's represent the alcoholic cases; black + 's represent the nonalcoholic cases. 



the ridge-regression-type inverse (S + £l PRPL )~ l is used, with e = 0.5. Not- 
ing that we have used only a portion of the data set, it is conceivable that 
an even stronger association could be established if the full data set were 
used. 

The matrix structure preserved by dimension folding is helpful to gain 
further insights into the relationship between EEG patterns and alcoholism. 
In particular, the right dimension-folding subspace contains the weights of 
the channels that are important in predicting alcoholism, whereas the left 
dimension-folding subspace contains the principal patterns of how voltage 
varies over time in the important channels. These could provide important 
information for understanding how each part of the brain and the way it 
responds to stimuli are related to alcoholism. 
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